Characterizing Information Diets of Social Media Users

نویسندگان

  • Juhi Kulshrestha
  • Muhammad Bilal Zafar
  • Lisette Espin Noboa
  • Krishna P. Gummadi
  • Saptarshi Ghosh
چکیده

With the widespread adoption of social media sites like Twitter and Facebook, there has been a shift in the way information is produced and consumed. Earlier, the only producers of information were traditional news organizations, which broadcast the same carefully-edited information to all consumers over mass media channels. Whereas, now, in online social media, any user can be a producer of information, and every user selects which other users she connects to, thereby choosing the information she consumes. Moreover, the personalized recommendations that most social media sites provide also contribute towards the information consumed by individual users. In this work, we define a concept of information diet – which is the topical distribution of a given set of information items (e.g., tweets) – to characterize the information produced and consumed by various types of users in the popular Twitter social media. At a high level, we find that (i) popular users mostly produce very specialized diets focusing on only a few topics; in fact, news organizations (e.g., NYTimes) produce much more focused diets on social media as compared to their mass media diets, (ii) most users’ consumption diets are primarily focused towards one or two topics of their interest, and (iii) the personalized recommendations provided by Twitter help to mitigate some of the topical imbalances in the users’ consumption diets, by adding information on diverse topics apart from the users’ primary topics of interest. Introduction The rapid adoption of social media sites like Twitter and Facebook is bringing profound changes in the ways information is produced and consumed in our society. Traditionally, people acquired information about world events via mass media, i.e., dedicated news organisations that relied on some broadcast medium like print (NYTimes or Economist), radio (NPR, BBC radio), or television (CNN, ESPN) to disseminate the information to large numbers of users. Mass media communications are characterised by (i) a small number (few tens to a few hundreds) of news organisations controlling what hundreds of millions of users consume, (ii) an expert team of editors at each news organisation carefully Copyright c 2015, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. vetting and selecting news stories to ensure a balanced coverage of important news stories, and (iii) all consumers receiving the same standardised information broadcast by each mass media source. In contrast to the organised world of information production and consumption in broadcast mass media, online social media sites like Twitter and Facebook offer a chaotic information marketplace for millions of producers and consumers of information. Unlike mass media, in social media (i) any of the hundreds of millions of users of these systems can be a producer as well as a consumer of information, (ii) these individual users are not expected to provide a balanced coverage of news-stories – they publish any information that they deem important or necessary to share with their friends in real-time, and (iii) information consumption is personalised and not all users consume the same information – every individual user selects (e.g., by establishing social links) her preferred sources of information from the millions of individual producers, and recommender systems deployed by social media platforms provide an additional source of information to the user. Thus, individual social media users might receive information that is not only unbalanced in terms of coverage of news-stories, but is also very different from what other users in the system receive. An entire discipline, media studies, has largely focused on analysing the coverage of information published on broadcast mass media and how it impacts the consumers of mass media. In contrast, research on understanding the composition of information produced and consumed by social media users is still in its infancy, being limited to a few macroscopic studies on the amounts of information posted by broad categories of users (e.g., celebrities) (Wu et al. 2011; Kwak et al. 2010). There has not been much work on analysing the composition of the information produced or consumed by users at the granularity of individual messages. In this paper, we take the first step towards addressing this challenge by defining the notion of information diet. Similar to diet in nutrition, information diet of a user refers to the composition of all the information consumed or produced by the user (Johnson 2012). Specifically, we focus on the topical composition of users’ diets, i.e., the fraction of their information diets that correspond to different topical categories of information (e.g., information on politics, sports, entertainment, and so on). One of our key goals is to better understand how the differences in information production and consumption processes between broadcast mass media and online social media affect users’ diets. So we conducted a comparative analysis of the topical compositions of the information diets produced, consumed, and recommended on social media and the mass media. Our investigation focused on the following three high-level questions: 1. Production: What is the topical composition of information published on broadcast mass media (e.g., NYTimes print edition)? How does the information produced by social media accounts compare with the information published on mass media? 2. Consumption: How balanced or unbalanced are consumption diets of social media users (relative to mass media diet)? Are users’ consumption diets heavily skewed towards a few topics of their interest, or do they also tend also receive information on a broad variety of topics covered in mass media? 3. Recommendations: Do personalised recommender systems deployed by the social media platform provide balanced or unbalanced diets (relative to mass media) to social media users? Do they mitigate or exacerbate the imbalances in the users’ consumption diets? We attempt to address the above questions in the context of the Twitter social media platform. To conduct our study, we needed a methodology to infer the topics of individual posts on Twitter. The bounded length of tweets makes it challenging to infer topics at the level of individual tweets. We propose a novel methodology to infer the topic of a post by leveraging the topical expertise of the Twitter users who have posted it. To obtain the information about users’ topical expertise, we leverage a methodology based on Twittre Lists, developed in our prior works (Ghosh et al. 2012; Sharma et al. 2012). We show that our methodology performs better at inferring topics for posts than a state-of-theart publicly deployed commercial topic inference system. Our study conducted using our above methodology yields several key insights. We highlight a few below: 1. Mass media sources cover a wide range of topics from politics and business to entertainment and health. But on social media, the individual sources of information are very focused and publish information dominated by a few topics. It is up to the social media users to select sources to obtain a balanced diet for themselves. 2. We find that for most users, a large fraction of their consumed diet comes from as few as one or two topics, and they hear very little about other niche topics like health and environment (unless they are interested in these topics). 3. We find that social recommendations, i.e., recommendations about information popular in a user’s social network neighbourhood (Gupta et al. 2014), often do not match the user’s preferred diet. The differences between recommended and consumed diets are likely due to differences in the interests of a user and the interests of her network neighbours. As a result, social recommendations introduce topical diversity to a user’s diet and can help balance its topical composition. We have publicly deployed a Web-based service for measuring the information diets produced and consumed by Twitter users, at http://twitter-app.mpi-sws.org/information-diets/. Our work and findings have a number of important implications. As social media becomes more popular, it is important to raise awareness about the balance or imbalance in information diets produced and consumed on social media. Our findings raise the need for better information curators (human editors or automated recommendation systems) on social media that provide a more balanced information diet. Finally, our work is an early attempt, and much future work still remains to be done both on understanding the impact of the diets on consumers in shaping their opinions and on other ways for quantifying the diets beyond topical composition. Related Work Analysis of content on mass media: Media studies has been an active field which analyzes the content coverage on mass media, and its effects on the society.1 There exist a number of ‘media watchdog organizations’ (e.g., FAIR (http://fair.org/), AIM (http://www.aim.org/)) which judge the content covered by news organizations based on fairness, balance and accuracy. Additionally, there have also been studies on media biases (Groseclose and Milyo 2005; Budak, Goel, and Rao 2014). Such studies are easier to perform over mass media since it is a broadcast medium and all users receive the same information. On the other hand, studying the information consumed on social media is much more challenging since individual users shape their own personalized channels of information by selecting the other users to follow. Information production & consumption on social media: Prior studies on information production and consumption on social media (Wu et al. 2011; Kwak et al. 2010; Cha et al. 2012) have been limited to studying the amount of information being exchanged among various users. There has not been any notable effort towards analyzing the topical composition of the information produced or consumed, which is the goal of this work. There have also been some prior works on whether social media users are receiving multiple perspectives on a specific event or topic (Balasubramanyan et al. 2012; Conover et al. 2011; Park et al. 2009; Adamic and Glance 2005; Borge-Holthoefer et al. 2015). Though we focus only on the topical composition of the information produced and consumed by social media users, the concept of information diet introduced in this work can be extended to study opinion polarization on social media. Topic inference of social media posts: To our knowledge, all prior attempts to infer the topic of a tweet / http://en.wikipedia.org/wiki/Media studies hashtag / trending topic rely on the content itself – either applying NLP and ML techniques (Quercia, Askham, and Crowcroft 2012; Ramage, Dumais, and Liebling 2010; Ottoni et al. 2014; Zubiaga et al. 2011) or mapping to external sources such as Wikipedia or Web search results (Meij, Weerkamp, and de Rijke 2012; Bernstein et al. 2010) – in order to infer the topics. Such methodologies are of limited utility in the case of social media like Twitter, primarily due to the tweets being too short, and the informal nature of the language used by most users (Sharma et al. 2012; Wagner et al. 2012). In contrast to these previous approaches which focus on the content, our methodology focuses on the characteristics of the authors of the content to infer its topic. Methodology: Quantifying Information Diets In this paper, we introduce the notion of information diet of a set of information items (e.g. a set of tweets or hashtags), as the topical composition of the information items. We define the topical composition over a given set of topics as the fraction of information related to each topic. In this section, we present our methodology for quantifying the information diet for a set of tweets on Twitter. We chose hashtags and URLs as the basic elements of information in a tweet and collectively refer to them as keywords. However, our methodology can be easily extended to include other kinds of keywords such as named entities. To justify our choice of keywords, we conducted a survey through Amazon Mechanical Turk (AMT: https://www.mturk.com/), where we showed workers 500 randomly selected tweets from Twitter’s 1% random sample which did not contain any keyword. A majority of the AMT workers judged 96% of the tweets without any keywords to be non-topical, i.e., they mostly contained conversational babble. Thus, the hashtags and URLs contain crucial signals about the topicality of tweets, justifying our decision to only consider hashtags and URLs as keywords for inferring the topic of tweets. The key step in our methodology for quantifying information diets consists of inferring the topic of a keyword, which is described next. Inferring topic of a keyword As discussed in the Related Work section, prior approaches for inferring the topic of a tweet / keyword rely on the content itself. Such approaches tend to perform poorly on short posts containing informal language (Sharma et al. 2012; Wagner et al. 2012). So we propose a different technique to infer the topic of a keyword which relies on the topical expertise of the users who are discussing that keyword. The basic intuition behind our technique is that if many users interested in a certain topic are discussing a particular keyword, that keyword is most likely related to that topic. To identify the topical expertise of users in Twitter, we leveraged the List-based methodology developed in our prior works (Sharma et al. 2012; Ghosh et al. 2012) to retrieve expertise tags for topical experts. For instance, some of the tags inferred by this methodology for the expert @ladygaga are ‘music’, ‘entertainment’, ‘singers’, ‘celebs’ and Topic categories Some related terms Arts-crafts art, history, geography, theater, crafts, design Automotive vehicles, motorsports, bikes, cars Business-finance retail, real-estate, marketing, economics Career jobs, entrepreneurship, human-resource Education-books books, libraries, teachers, school Entertainment music, movies, tv, radio, comedy, adult Environment climate, energy, disasters, animals Fashion-style style, models Food-drink food, wine, beer, restaurants, vegan Health-fitness disease, mental-health, healthcare Hobbies photography, tourism, gardening Paranormal astrology, supernatural Politics-law politics, law, military, activism Religion christianity, islam, hinduism, spiritualism Science physics, chemistry, biology, mathematics Society charity, LGBT Sports football, baseball, basketball, cricket Technology mobile-devices, programming, web-systems Table 1: The 18 topic-categories to which keywords / tweets will be mapped, and some terms related to each topic. The terms will be matched with expertise-tags. ‘artists’. We extracted topical expertise of 771,000 experts on Twitter by using this methodology. The details of the methodology are omitted here for brevity. Next, we used two standard topical hierarchies – the Open Directory Project (www.dmoz.org) and AlchemyAPI (www.alchemyapi.com/api/taxonomy/) – to obtain 18 topical categories and their related terms, as shown in Table 1. The 18 topical categories were selected by combining the top categories of the two hierarchies, while the related terms were derived from their lower levels. In the rest of the paper, we quantify information diets by inferring the fraction of information from each of these 18 topics. We also mapped the experts to one or more of the 18 topic categories, by matching the inferred tags of each expert to the related terms of the topical categories. As stated earlier, the main intuition behind our methodology is that if several experts on a topic are posting a keyword, then that keyword is most likely related to that topic. To infer the topic of a keyword k, we first identify the set of experts Ek who have posted k. We do not attempt to infer the topic of a keyword unless it has been posted by at least 10 of our identified experts. For each topic t (in Table 1), we then determine the fraction (ft) of experts in Ek who are mapped to that topic t. Next, to account for the varying number of experts mapped to different topics, we normalize the fraction ft by the total number of experts on topic t in our data set. Finally, we select the topic with the highest normalized fraction ft to be the inferred topic of keyword k. Further details of the methodology can be found at http://twitter-app.mpisws.org/information-diets/. Evaluating the topic inference methodology We now present the evaluation of the performance of our proposed topic inference methodology, and compare its performance with that of a state-of-the-art commercial service, Metric Methodology Hashtags Popular Random Coverage AlchemyAPI 22.5% 55.5% Proposed 98% 82.5% Accuracy AlchemyAPI 44.44% 51.35% Proposed 58.67% 49.69% Table 2: Comparing the proposed topic inference methodology with AlchemyAPI (which uses NLP techniques) in terms of coverage and accuracy. AlchemyAPI, that uses NLP and deep-learning techniques for topic inference. We found the performances to be very similar for both hashtags and URLs; hence, for brevity, we only present the evaluation results for hashtags. The set of hashtags used for evaluation is derived from the Twitter 1% random sample 2 from a week in December 2014. It consists of: (i) 200 popular hashtags which were most tweeted, and (ii) 200 randomly selected hashtags. We inferred the topic of a hashtag using AlchemyAPI by passing 1000 randomly selected tweets containing the hashtag. Table 2 compares the performance of the proposed methodology with AlchemyAPI, based on two metrics coverage and accuracy. Coverage: It is defined as the fraction of keywords for which a methodology is able to infer a topic. Table 2 shows that our proposed methodology performs significantly better than AlchemyAPI, which possibly fails due to the informal and abbreviated language used in most tweets. Note that our methodology is able to infer topics for a relatively smaller fraction of random hashtags than the popular ones, since we need the hashtag to be posted by at least 10 experts. Accuracy: It is defined as fraction of keywords for which the inferred topic is relevant. Relevance was judged through an AMT survey – we showed the hashtag, 20 random tweets containing the hashtag, and the inferred topic to five AMT workers and asked them to judge if the inferred topic of the hashtag is relevant. Table 2 shows the majority opinion of the five workers – the proposed methodology is accurate for a larger fraction of popular hashtags, while AlchemyAPI performs slightly better for randomly selected hashtags. Overall, our proposed methodology performs better than a state-of-the-art NLP-based technique in inferring topics of hashtags, especially for popular ones – not only does the proposed methodology infer topics for more hashtags, but also the inferred topics are more accurate. Quantifying information diet of social media posts Having established the methodology to infer the topic of a keyword, we now use it to construct the information diet of a set of tweets. We first extract the keywords from every tweet in the set and infer the topic of each individual keyword. We then construct a topic-vector for the given set of tweets, where the weight of a topic is the total contribution of all keywords inferred to be on that topic. Since a tweet can We considered only English tweets, i.e., tweets in which at least half of the words occur in a standard English dictionary. contain multiple keywords, we normalize the contribution of each keyword within a tweet by the number of keywords in that tweet (so that each tweet contributes a total weight of 1 to the topic-vector). This topic-vector represents the information diet of the given set of tweets. Limitations of our methodology We briefly discuss some limitations in our approach of quantifying the information diets of users. First, since we infer the topics of only those keywords which have been tweeted by at least 10 topical experts, we have a lower coverage and accuracy for non-popular keywords. However, the later sections show that the popular information forms a large fraction of users’ diets; hence, the approach is likely to be able to estimate the information diets of users fairly accurately. Second, while we only focus on information that a user posts or consumes on Twitter, we are aware that a user in Twitter is also likely to get information from other online as well as off-line sources. However, as users are relying more and more on social media sites such as Twitter and Facebook to find interesting information (Jane Sasseen et al. 2013), what a user consumes in Twitter is likely to be an increasingly significant factor in shaping her overall information diet. Mass Media Diet As mentioned earlier, the goal of this study is to compare and contrast the processes of production and consumption of information over broadcast mass media and over social media. We analyze the information being published over mass media by three popular news organizations – NYTimes, Washington Post and The Economist. We collected their broadcast print editions for three days in December 2014, and categorized the news-articles into our 18 topic-categories (Table 1) through human feedback. Each news-article was shown to five distinct workers recruited through AMT, and the majority verdict was considered as the topic for the news-article. Table 3 shows the mass media information diets of the three news organizations. We find that all the news organizations tend to focus (i.e., post majority of their news-articles) on a few popular topics – politics, entertainment, and sports for NYTimes and Washington Post, and mainly politics and business-finance for The Economist. However, despite their bias towards these few popular topics, the mass media diets also have a spread over the remaining less popular topics – the 12 least popular topics contribute 25% of the diet for NYTimes and 17% for both Washington Post and Economist. In the following sections, we use these mass media information diets as a baseline for comparing with various information diets on social media. Production: Social vs. Mass Media Diets Traditionally, in mass media, editors of news-organizations are expected to ensure that the news-stream has a balanced coverage across various topics of interest of the subscribers, by following definite guidelines. In contrast, every useraccount in social media serves as a producer / source of information, and there are no definite guidelines on the content Topic NYTimes Wash. Post Economist Arts-Crafts 4.56% 0.0% 1.85% Automotive 1.34% 0.0% 0.37% Business-Finance 7.51% 8.65% 28.04% Career 0.8% 0.48% 0.74% Education-Books 1.88% 5.29% 3.32% Entertainment 12.33% 13.94% 1.48% Environment 3.49% 0.96% 7.01% Fashion-Style 0.0% 1.44% 0.0% Food-Drink 4.83% 6.25% 2.21% Health-Fitness 6.17% 5.29% 2.95% Hobbies-Tourism 1.34% 0.0% 0.37% Paranormal 0.27% 0.0% 0.0% Politics-Law 29.49% 37.5% 35.06% Religion 2.14% 0.96% 2.95% Science 1.34% 0.96% 2.58% Society 3.75% 6.73% 3.32% Sports 15.01% 9.62% 1.11% Technology 3.75% 1.92% 6.64% Table 3: Mass media information diets of three news organizations, where the topics of the news-articles were judged by AMT workers (top topics highlighted). being posted by any account. To analyze the effects of these differences, this section compares various information diets being produced in social media with those of mass media (described in the previous section). News organizations: Social media vs. mass media We first address the question: are there differences between the information diets published by news organizations over mass media and social media? To answer this question, we collected the tweets posted by the Twitter accounts of the three selected news organizations (NYTimes, Washington Post and The Economist) during December 2014, and generated the information diet produced by these news organizations over social media.3 Interestingly, we find that each of the three news organizations has multiple accounts on Twitter. These include one primary account (@nytimes, @washingtonpost and @economist) and several topic-specific accounts (e.g., @NYTSports, @EconSciTech, @PostHealthSci) each of which specializes in posting news-stories on a particular topic. Table 4 shows some of the topic-specific accounts of the three news organizations, along with the fraction of their production diet that is on the topic of specialization. It is evident that the topic-specific accounts produce a much larger fraction of their diet on their specific topics of specialization, as compared to the mass media diet of the same news organization. While the topic-specific accounts of the news organizations have thousands to hundreds of thousands of followers, The statistics presented in this section are for the same three days in December 2014, over which the mass media diets were analyzed in the previous section. However, we observed that the information diets remain relatively unchanged over longer timedurations. Social media Topic of Contribution of topic account specialization Social Mass media Media NYTSports Sports 66.6% 15.0% nytimesbusiness Business 66.1% 7.5% nytimesbooks Edu-Books 59.1% 1.9% EconUS Business 74.4% 28.0% EconWhichMBA Education 37.6% 3.3% Business 32.1% 28.0% PostSports Sports 88.5% 9.6% PostHealthSci Science 34.5% 0.96% Health 25.1% 5.3% WaPoFood Food 60.3% 6.3% Table 4: Examples of topic-specific Twitter accounts of news organisations, along with the contribution of their topics of specialization in their production diet. a much larger number of users subscribe to the primary accounts. For instance, the primary account @nytimes has 15 million followers, while the topic-specific accounts @NYTSports and @nytimesbusiness have 51K and 567K followers respectively. Since most social media users consume the diet produced by the primary account, we compare the social media diet produced by the primary account with the mass media diet of the same news organization. Figure 1 compares the information diets produced by the three news organizations over mass media, with those produced by their primary Twitter accounts over social media. We find two main differences between the mass media and social media diets of the same news organization. First, the primary accounts of the news organizations in social media tend to publish less content (as compared to the corresponding mass media diets) on those topics for which there exist topic-specific accounts. For instance, for both NYTimes and Washington Post, topics such as sports and food are covered much lesser in the social media diets than in the corresponding mass media diets. Additionally, both the primary and the topic-specific social media accounts of the news organizations tend to be more specialized in their production by focusing on fewer topics, as compared to their mass media diets. For example, while the mass media diet of Economist focuses on both business and politics, the social media diet of @economist focuses solely on business and publishes far lesser content on politics. In summary, there is an unbundling of content on social media by the news organizations through multiple accounts each specializing on a particular topic. This unbundling would enable users in social media to get focused information on their topics of interest by subscribing to the topicspecific accounts. However, the users who subscribe to only the primary account of the news organizations might not be aware that they are receiving a different information diet as compared to that of the mass media versions. Popular social media accounts vs. mass media Next, we study whether our observations about the specialized production of the social media accounts of news organizations generalizes to other popular user-accounts in Twitter.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Characterizing Information Diets of Social Media Users

With the widespread adoption of social media sites like Twitter and Facebook, there has been a shift in the way information is produced and consumed. Earlier, the only producers of information were traditional news organizations, which broadcast the same carefully-edited information to all consumers over mass media channels. Whereas, now, in online social media, any user can be a producer of in...

متن کامل

A Knowledge Management Approach to Discovering Influential Users in Social Media

A key step for success of marketer is to discover influential users who diffuse information and their followers have interest to this information and increase to diffuse information on social media. They can reduce the cost of advertising, increase sales and maximize diffusion of information.  A key problem is how to precisely identify the most influential users on social networks. In this pape...

متن کامل

Mass Media vs. the Mass of Media: A Study on the Human Nodes in a Social Network and their Chosen Messages

In Internet-based social networks, the nodes have the most pivotal role in the processes and outcomes of the networks. Whether they pay attention to a message in the network or ignore it defines the fate of the message. One message is shared and re-shared by millions of users and another is left forgotten. The current study tries to shed light on one aspect of the role of the users in a social ...

متن کامل

Someato: characterizing and exploiting behavior and interests of users in social media

Characterizing and understanding behavior and interests of social media members is crucial for enhancing several practical applications in this environment, such as target marketing or recommendation. Aiming to characterize and exploit such behavior and interests, this work proposes Someato, a novel online analytics tool. Besides a characterization methodology, Someato comprises a search and a ...

متن کامل

Social Media and Information Consumption Diversity

Social media platforms are having a profound impact on the so-called information ecosystem, specifically on how information is produced, distributed and consumed. Social media in particular has contributed to the rise of user generated content and consequently to a greater diversity in online content. On the other hand, social media networks, such as Twitter or Facebook, have become information...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015